The Challenge

The VAST Challenge was designed to test the skills of visual analytics researchers and developers. The VAST Challenge was composed of three mini challenges; my colleague Jean-Carlos Paredes and I worked together on the first. The main text of the first challenge was as follows:

“As part of his investigation, ornithology student Mitch Vogel needs to examine the movement of traffic through the Boonsong Lekagul Nature Preserve. His first working hypothesis is that there is some link between the traffic going through the preserve and the decline in the nesting Rose-crested Blue Pipit-maybe the traffic noises are drowning out mating calls! Or perhaps he can discover some odd goings on in the traffic patterns-perhaps campers are invading the bird’s habitat areas?”"

Our understanding was: expect to find something weird, and relate this to birds.

The Data

The data set had 171,477 rows of 4 columns: Timestamp, Vehicle Type, Vehicle ID, and Location (one of the labeled points on the map). The data were collected over a period of 13 months: 5/1/2015 - 5/31/2016

Data Snippet
2015-05-01 01:12:42,20151201011242-330,5,entrance0
2015-05-01 01:14:22,20151201011242-330,5,general-gate1
2015-05-01 01:17:13,20151201011242-330,5,ranger-stop2

And a note on vehicle types.
Vehicle Type Description
1 2 axle car (or motorcycle)
2 2 axle truck
2P 2 axle truck (Park Vehicle)
3 3 axle truck
4 4 axle (and above) truck
5 2 axle bus
6 3 axle bus


A map of the fictional park was provided for context.

Feature Extraction

We wrote code in python to use timestamps generate paths through the park of the following form: {Node, time-delta, Node, time-delta, … , Node}.

Then, using these paths, we computed the population of each of the campsites at the time the vehicle entered the park (do new campers avoid high population campsites?). This was an array of 9 numbers, one integer for each campsite.

Data Validation

To ensure the validity of our extracted features, we modeled the park as network, and created population-over-time plots to catch any obvious errors.

Park As A Network

What if bird populations are plummeting because someone has taken a fancy to off-road shenanigans? The park was small enough that we could draw out the connections by hand.

It looks like a cute bug to me! It’s “body” and “eyes” are completely connected subgraphs with their nodes listed above. Other nodes are labeled.

We wrote this out as an adjacency matrix in excel, and then used python code to check that paths never skipped past nodes in ways they shouldn’t. We found no evidence of off-road behavior.

Campsite Populations Over Time

The only way it occured to us to validate campsite-population-over-time plots were these three criteria:

  1. Do not exceed the total trend
  2. Start and end at (or close to) zero
  3. Never go negative


All Park Activity Over Time

Inidividual Campsite Populations Over Time (Campsite 0 - 9, left to right & top to bottom)

These plots don’t lead us to reject our extracted feature.

They also show us two things: Campsite #1 was extremely unpopular relative to the others, and also there does not seem to be a maximum number of people that can stay at a campsite at one time.

Campsite #5 reached a record population count at 74. If there was some kind of wait-list phenomenon, we would expect to see it sustain that value. Even if people don’t wait around for open spot, if there were a limited number of opens spots we should expect to see that maximum value reached many times (or at least more than once).

From these data, we cannot conclude whether or not these campsites had maximum capacities. Furthermore, even if there are, they do not seem to have been reached during the period these data were recorded.

Classification

Using the type of vehicle and the paths, we were able to identify a few types of behavior:

Class Name Number Observed Percent of Observations
Thru Traffic 8008 42.83%
Campers 6513 34.84%
Park Rangers 998 5.33%
Mystery Car 1 00.00%
Unclassified 3176 16.98%
Total 18696 100.00%


Thru Traffic

A little more than 40% of all cars moved through the park without significant stops. (The longest any Traffic vehicle spent between gates was 26 minutes).
A typical traffic route is shown below. Each route is shown on additional plots here.


Campers

About 35% of paths went to a campsite, stayed there for a significant amount of time (we arbitrarily chose 12 hours), and then left the park. In this case, the vehicle left along the same path it came, but this was not always the case.


Park Rangers

Park rangers were identified by their vehicle type, and frequently took a thorough tour of the camp along a patrol-like route. They frequently originated and returned to the ranger base, but this was not always the case.


The Mystery Vehicle

One vehicle stood out sharply against the others. Car ID 20155705025759-63 entered the park on June 5, 2016 around 3pm (about a month after data collection began) and remained in the park for at least 360 days- when data collection stopped. The car moved between campsites and would stay at each for about 30 days before moving to the next. The car did not visit campsites 1, 7, or 8, but visited all the others at least twice. Whatever this car is doing, it is certainly anomalous. If I could advise the park rangers within a month of when they stopped data collection, I would tell them to investigate campsite #5 where the vehicle was last seen!


Do campers avoid busy campsites?

In order to answer this question, we fit a logistic regression model for each camp, using all campsite populations as predictors.

Model Evaluation

The p-values for these models are displayed in a heatmap. (Note: 1 = C0/Not C0 as response, 2 = C1/Not C1, etc.)


Thhe intercept here is irrelevant. If there is any activity in the park, it must be in one of the reponse classes. Intercept aside, there are a few p-values < 0.05. Since we tested 64 relationships, a few low p-values are to be expected by chance. This chart is a good piece of evidence that campsite populations do not affect choice in campsite. But just for grins and giggles, we evaluated the predictive performance of these models anyway.

Prediction Performance

The prediction values for each observation was a float between 0 and 1. To turn these values into classifications, we had to choose cutoff values. We use the proportions of classes from training data to choose a quantile as a cutoff value for these threshholds. For example, about 20% of all camps in the training data were campsite #5. IT made intuitive sense that the top 20% of prediction values from the campsite #5 model should be classified as campsite #5.

Looking at all the predictions produced by one of our models, we saw some pretty wonky distributions (see additional plots.). If this model were more promising, it may have made sense to account for the non-normality of these distributions when choosing percentiles.

Our final evaluation of the model was simplistic. We used the predictions for each model to list the overall outcomes- some observations remained unclassified, while other observations had several classifications. These conflicts may have been resolved by comparing the prediction values.

We continue our analysis generously. Perhaps there is value if at least one classification of the multi-classified observations is correct. (It can help to have your options narrowed.) What follows is the our confusion matrix, with actual and general percent accuracies to follow.

Actual Number Correct: 90
Actual Accuracy: 6.91%
Generous Number Correct: 132
Generous Accuracy: 10.13%
Accuracy of Random Guess: 11.11%
Accuracy by always guessing most common class: 20.34% (C5)

Conclusions

What is happening to the birds? The mystery car (ID: 20155705025759-63) definitely merits investigation, especially if its favorite spots are part of the birds’ territory. There also seems to be a lot of traffic for a nature preserve, around 43% of all activity. We also confirmed there was no off-road behavior- it’s probably safe to confine the investigation to the visitors on the roads. These are the certain fruits of out analysis.


Next Steps

But we left many questions unanswered. If we were to continue our work, we might follow one of these routes: